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Abstract — We propose a novel boosting approach to multi- References 
class classification problems, in which multiple classes are distin- 
guished by a set of random projection matrices in essence. The 
approach uses random projections to alleviate the proliferation 
of binary classifiers typically required to perform multi-class 
classification. The result is a multi-class classifier with a single 
vector-valued parameter, irrespective of the number of classes 
involved. Two variants of this approach are proposed. The first 
method randomly projects the original data into new spaces, 
while the second method randomly projects the outputs of learned 
weak classifiers. These methods are not only conceptually simple 
but also effective and easy to implement. A series of experiments 
on synthetic, machine learning and visual recognition data sets 
demonstrate that our proposed methods compare favorably to 
existing multi-class boosting algorithms in terms of both the 
convergence rate and classification accuracy. 

Index Terms — Boosting, multi-class classification, randomiza- 
tion, column generation, convex optimization. 
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I. Introduction 

Multi-class classification has not only become an important 
tool in statistical data analysis, but also a critical factor in 
the progress that is being made towards solving some of 
the key problems in computer vision, such as generic object 
recognition. Applications of multi-class classification vary, but 
the objective in each is to assign the correct class label to each 
input test example, whether it be assigning the correct value 
to a handwritten digit, or the correct identity to a face. 

Boosting is a well-known machine learning algorithm which 
builds a strong ensemble classifier by combining weak learners 
which are in turn generated by a base learning oracle. The 
fact that a wide variety of weak learners can be employed 
makes the algorithm extremely flexible, yet it has been shown 
that boosting is robust and seems resistant to over-fitting 
in many cases [l]-[4]. A boosting classifier is made up 
of a set of weak classification rules and a corresponding 
set of coefficients controlling the manner in which they are 
combined, and many multi-class variants have been proposed. 
Most of these algorithms reduce the multi-class classification 
problem to multiple binary-class problems and learn a coding 
matrix or a vector of coefficients for each class (e.g., [5]- 
[9]). The main justification for this reduction is the fact 
that binary classification problems are well studied and many 
effective algorithms have been carefully designed. In contrast 
to existing approaches, we propose to learn a single model 
with a single vector of coefficients that is independent of 
the number of classes. We achieve this by using random 
projections as the main tool. Random projections have been 
widely used as a dimensionality reduction technique in many 
areas, e.g., signal processing [10], machine learning [11], [12], 
information retrieval [13], data mining [14], face recognition 
[15]. The algorithm is based on the idea that any input feature 
spaces can be embedded into a new lower dimensional space 
without significantly losing the structure of the data or pairwise 
distances between instances. We choose random projections 
since we want to introduce diversity in the data space (either 
the original input data space or the weak classifiers' output 
space) for multi-class problems while preserving pairwise 
relationships. To our knowledge, this is the first time that 
random projections are used to simplify and implement multi- 
class boosting classification. 

Our main contributions are as follows. 

• We propose a new form of multi-class boosting which 
trains a single-vector parameterized classifier irrespective 
of the number of classes. We illustrate this new approach 
by incorporating random projections and pairwise con- 
straints into the boosting framework. 

• Two algorithms are proposed based on this high-level 
idea. The first algorithm randomly projects the original 
data into new spaces and the second algorithm randomly 
projects the outputs of selected weak classifiers. We 
then design multi-class boosting based on the column 
generation technique in convex optimization. 

The first algorithm is optimized in a stage-wise fashion, 
bearing resemblance to RankBoost [16] (and AdaBoost 
because of the equivalence of RankBoost and AdaBoost 



[17]). The optimization procedure of our second method 
is inspired by the totally corrective boosting framework 
[9], [18], although for the second approach, the mecha- 
nism for generating weak classifiers is entirely different 
from all conventional boosting methods. Our new design 
is not only conceptually simple, due to the reduced 
parameter space, but also effective as we empirically 
demonstrate on various data sets. 

• We theoretically justify the use of random projections by 
proving the margin separability in the proposed boosting. 
This theoretical analysis provides some insights in terms 
of the margin preservation and the minimal number of 
projected dimensions to guarantee margin separability. 

• We empirically show that both proposed methods perform 
well. We demonstrate some of the benefits of the pro- 
posed algorithms in a series of experiments. In terms of 
test error rates, our proposed methods are at least as well 
as many existing multi-class algorithms. We have made 
the source code of the proposed boosting methods acces- 
sible at http://code.google.eom/p/boosting/downloads/. 

Next we review the literature related to random projections 
and multi-class boosting. 

II. Related work 

Random projections have attracted much research interest 
from various scientific fields, e.g., signal processing [10], 
clustering [12], multimedia indexing and retrieval [19]— [21], 
machine learning, [13], [22] and computer vision [15], [23]. 
Random projections are a powerful method of dimensionality 
reduction. The technique involves taking a high-dimensional 
data and maps it into a lower-dimensional space, while pro- 
viding some guarantee on the approximate preservation of dis- 
tance. Random projections have been successfully applied in 
many research fields. One of the most widely used applications 
of random projections is sparse signal recovery. Candes and 
Tao show that the original signal can be reconstructed within 
very high accuracy from a small number of random mea- 
surements [10]. Random projections have also been applied 
in data mining as an efficient approximate nearest neighbour 
search algorithm. The search algorithm, known as locality 
sensitive hashing (LSH), approximates the cosine distance in 
the nearest neighbour problem [14]. The basic idea of LSH 
is to choose a random hyperplane and use it to hash input 
vectors to a single bit. Hash bits of two instances match with 
probability proportional to the cosine distance between two 
instances. Unlike traditional similarity search, LSH has been 
shown to work effectively and efficiently for large-scale high- 
dimensional data. 

In machine learning, random projections have been applied 
to both supervised learning and unsupervised clustering prob- 
lems. Fern and Broadley show that random projections can be 
used to improve the clustering result for high dimensional data 
[12]. Bingham and Manilla compare random projections with 
several dimensionality reduction methods on text and image 
data and conclude that the random lower dimension subspace 
yields results comparable to other conventional dimensionality 
reduction techniques with significantly less computation time 
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[22]. Fradkin and Madigan explore random projections in a 
supervised learning context [13]. They conclude that random 
projections offer clear computational advantage over principal 
component analysis while providing a comparable degree of 
accuracy. Thus far, we are not aware of any existing works 
which apply random projections to multi-class boosting. 

Boosting is a supervised learning algorithm which has 
attracted significant research attention over the past decade due 
to its effectiveness and efficiency. The first practical boosting 
algorithm, AdaBoost, was introduced for binary classification 
problems [24]. Since then, many subsequent works have been 
focusing on binary classification problems. Recently, however, 
several multi-class boosting algorithms have been proposed. 
Many of these algorithms convert multi-class problems into a 
set of binary classification problems. Here we loosely divide 
existing work on multi-class boosting into four categories. 

One-versus-all The simplest conversion is to reduce the 
problem of classifying k classes into k binary problems, where 
each problem discriminates a given class from other k — 1 
classes. Often k binary classifiers are used. For example, to 
classify digit '0' from all other digits, one would train the 
binary classifier with positive samples belonging to the digit 
'0' and negative samples belonging to other digits, i.e., T, 
• • • , '9'. During evaluation, the sample is assigned to the class 
of the binary classifier with the highest confidence. Despite the 
simplicity of one-versus-all, Rifkin and Klautau have shown 
that one-versus-all can provide performance on par with that 
of more sophisticated multi-class classifiers [25]. An example 
of one-versus-all boosting is AdaBoost.MH [26]. 

All-versus-all In all-versus-all classifiers, the algorithm 
compares each class to all other classes. A binary classifier 
is built to discriminate between each pair of classes while 
discarding the rest of the classes. The algorithm thus builds 
- L ~2 — - binary classifiers. During evaluation the class with the 
maximum number of votes wins. Allwein et al. conclude that 
all-versus-all often has better generalization performance than 
one-versus-all algorithm [5]. The drawback of this algorithm 
is that the complexity grows quadratically with the number of 
classes. Thus it is not scalable in the number of classes. 

Error correcting output coding (ECOC) The above two 
algorithms are special cases of ECOC. The idea of ECOC 
is to associate each class with a codeword which is a row 
of a coding matrix M € R kxT and My € {-1,0,1}. The 
algorithm trains T binary classifiers to distinguish between k 
different classes. During evaluation, the output of T binary 
classifiers (a T-bit string) is compared to each codeword 
and the sample is assigned to the class whose codeword has 
the minimal hamming distance. Diettrich and Bakiri report 
improved generalization ability of this method over the above 
two techniques [6]. In boosting, the binary classifier is viewed 
as weak learner and each is learned one at a time in sequence. 
Some well-known ECOC based boostings are AdaBoost. MO, 
AdaBoost.OC and AdaBoost.ECC [7], [8]. Although this tech- 
nique provides a simple solution to multi-class classification, 
it does not fully exploit the pairwise correlations between 
classes. 

Learning a matrix of coefficients in a single optimiza- 
tion problem One learns a linear ensemble for each class. 
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Fig. 1. Flowchart illustration of RBoost mnk and RBoost P ro J. 

Given a test example, the label is predicted by argmax r 
J2t W r tht(x). Each row of the matrix W corresponds to one 
of the classes. The sample is assigned to the class whose row 
has the largest value of the weighted combination. To learn the 
matrix W, one can formulate the problem in the framework 
of multi-class maximum-margin learning. Shen and Hao show 
that the large-margin multi-class boosting can be implemented 
using column generation [27]. 

In contrast to previous works on multi-class boosting, we 
propose two novel boosting approaches that learn a single- 
vector parameterized ensemble classifier to distinguish be- 
tween classes. We achieve this through the use of random 
projections and pairwise constraints. To be specific, the first 
algorithm randomly projects each training datum to a new 
space. We then show that the multi-class learning problem can 
be reduced to a ranking problem. For the second algorithm, 
we train the multi-class boosting by randomly projecting the 
outputs of learned weak classifiers. 

Notation We use a bold lowercase letter, e.g., x, to denote 
a column vector and a bold uppercase letter, e.g., P, to denote 
a matrix. The ij -th entry of a matrix P is written as Py. Pi- 
wad P-j are the z-th row and j-th column of P, respectively. 
Let (xi,yi) € R d x {1, 2, • • • , k},i = 1, • • • , m be a set of m 
multi-class training samples where k is the number of classes. 
Let T be the maximum number of boosting iterations and the 
matrix, H g ]R mxT , denote the weak classifiers' response on 
the training data. Each column H t contains the output of the 
t-th weak classifier ht{-)- Each row Hi- contains the outputs 
of all weak classifiers from the training instance Xj. 

III. Our approach 

Many existing multi -class boosting algorithms learn a strong 
classifier, and a corresponding set of weights, w r g R T , 
for each class r. The two novel methods that we propose 
here, however, learn a single vector of weights, w € R T , 
for all classes, our approaches are conceptually simple and 
easy to implement. We illustrate our new approaches by 
incorporating random projections into the boosting framework. 
Since random projections can be applied to either the original 
raw data or other intermediate results (for example, weak 
classifiers' outputs), we can formulate the multi-class problem 
as: 1) a pairwise ranking problem that is based on random 
projections of the original data; and 2) a maximum margin 
problem that is based on random projections of the weak 
classifiers' outputs. 
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In our first approach, we generate a random Gaussian 
matrix P 6 M" xd , whose entry P(i,j) is ^aij where ay 
is i.i.d. random variables from N(0, 1). We multiply it with 
each training instance, x E M dxl , to obtain a projected data 
vector, Px E M™ xl . The projected vector, Pa;, approximately 
preserves all pairwise distances of input vector x, provided 
that P consists of i.i.d. entries with zero mean and constant 
variance [11]. In our second approach, we generate a random 
Gaussian matrix, P E M" xT . We then multiply it with the 
output of weak learners, P[hi(-), /12G); ■ • ■ , ^t(-)] T > to obtain 
a new weak learners' output space, PHj. E R" xl . In short, 
both algorithms rely on the use of random projections to 
obtain the single-vector parameterized classifier. However, in 
the first algorithm, the original raw data is randomly projected 
while in the second algorithm, the weak learners' outputs are 
randomly projected. A high-level flowchart that illustrates both 
approaches is shown in Fig. 1. 

A. Multi-class boosting by randomly projecting the original 
data 

We first formulate the multi-class learning problem as a 
pairwise ranking problem. The basic idea of our approach 
is to learn a multi-class classifier from the same instance 
being projected using k different random projection matri- 
ces. We create k pre-defined random projection matrices, 
p( 2 ) ; • • • ,p( fc ), for each class, where the superscript 
indicates the class label associated to the random projection 
matrix. Given a training instance, (£Cj,j/j), the following con- 
dition, F{P { -y^x l ) > F(P^Xi),Vr ^ yi, has to be satisfied. 1 
That is to say, the correct model's response must be larger than 
all the incorrect models' responses. We can strengthen this by 
requiring that the difference F(P^Xi) - F(P( r ) Xi) is as 
large as possible. Motivated by the large margin principle, 
we formulate our multi-class problem in the framework of 
maximum-margin learning. 

Learning using the exponential loss Given that we have 
m training samples with k classes, the total number of such 
pairwise relations is m(fc— 1). Putting it into the large-margin 
learning framework, we can define the margin associated with 
the above condition as, F(P^ Vi 'Xi) — F(P' r 'a!j), which can 
be explicitly rewritten as, 

Pir = F(P^ Xi ) - F(P^ Xi ) (1) 

= Et=M pim)x i) w t - E?=A(P (r) *iM 

= 6h(P ( - Vi \p( r \x i ) T w, 
where ^(Pb^pW,^) = h t {P^Xi) - h t [P^Xi), and 

Sh(; •) = [6hl(; ; -),5h 2 {; ;■),■■■, 5h T {; ; -f E M TXl 

is a column vector. The purpose is to learn a regularized model 
that satisfies as many constraints, pi r > 0, as possible. That 
is to say, we minimize the training error of the model with 
a controlled capacity. In theory, both of the two proposed 
boosting methods can employ any convex loss function. We 

'For simplicity, we omit the model parameter w. 



first show how to derive the boosting iteration using the 
exponential loss. Later we generalize it to any convex loss. 
With the exponential loss, the primal problem can be written 

as, 

min log(V^ exp (— p ir )) + isl 7 w (2) 

w.p ^ — 'ir 

s.t.: p ir ^ Sh(P {y '\P (r) , Xi) T w, y pair {ir); w>0, 

where (ir) represents the joint index through all of the data 
and all of the classes. Taking the logarithm of the original cost 
function does not change the nature of the problem as log(-) is 
strictly monotonically increasing. This formulation is similar 
to the binary totally corrective boosting discussed in [9]. Also 
we have applied the l\ norm regularization as in [9], [18] to 
control the model complexity. 

If we can solve the optimization problem (2), the learned 
model can be easily obtained. Unfortunately, the number of 
weak learners is usually extremely large or even infinite, 
which corresponds to an extremely or infinite large number 
of variables w, it is usually intractable to solve (2) directly. 
Column generation can be used to approximately solve this 
problem [9], [18]. We need to derive a meaningful Lagrange 
dual problem such that column generation can be applied. The 
Lagrangian is L — iog(J2 ir exp(-/9 ir )) + vl T w - vJ p + 
2 ir !iir<5S(P W ,P w ,x«) ~ q T w. with q > 0. The dual 
problem can be obtained as sup u inf w p L. 

The Lagrange dual problem can be derived as 

min J2ir U ir log(uir) (3) 

U 

s.t.: J2 lr u„dh(P^\P^ r \x l ) T < zaL t , u > 0,1 t m = 1. 

As is the case of AdaBoost [9], the dual is a Shannon 
entropy maximization problem. The objective function of the 
dual encourages the dual variables, u, to be uniform. The 
Karush-Kunh-Tucker (KKT) optimality condition gives the 
relationship between the optimal primal and dual variables: 

u ir = J* P( 7 fr) y (4) 

The primal problem can be solved using an efficient Quasi- 
Newton method like L-BFGS-B, and the dual variables can 
be obtained using the KKT condition. From the dual, the 
subproblem for generating weak classifiers is, 

h*(-) = argmax Vii lr 6/i(P tol) , P (r) , a;,). (5) 

This corresponds to find the most violated constraint of the 
dual problem (3). The details of our ranking based multi-class 
boosting algorithm are given in Algorithm 1. 

Stage-wise boosting The advantage of the algorithm out- 
lined above is that it is totally corrective in that the primal 
variables, w, are updated at each boosting iteration. However, 
the training of this approach can be expensive when the 
number of training data and classes are large. In this section, 
we design a more efficient approach by minimizing the loss 
function in a stage-wise manner, similar to those derived in 
AdaBoost. Looking at the primal problem (2), the optimal w 
can be calculated analytically as follows. At iteration t, we fix 
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Algorithm 1 Column generation based RBoost™ 
Input: 

— A set of examples {xi,yi), i — 1 • • • m; 

— The maximum number of weak classifiers, T; 

— Random projection matrices, p' r * £ M TlXd , r — 1 ■ 
Output: The learned multi-class classifier 

F(x) — argmax^JLi w t h t (P^x). 

r=l--k 

Initilaize: 

— t <- 0; 

— Initialize sample weights, Ui. 
while t < T do 

® Train a weak learner, ht{-), using (5); 

(D If the stopping criterion, X^^i^^t (P (yi 5 , »*) < f + e, has 

been met, stop the training; 

® Add the best weak learner, ft-t(-), into the current set, by solving (5); 
® Solve the primal problem, (2), or dual problem (3); 
© If the primal problem is solved, update sample weights (dual variables) 
using (4); 
® t <- t + 1; 



-l)fc 1 



the value of Wi, W2, • • • , iwt-i. So is the only variable to 
optimize. The primal cost function can then be written as 2 



F P = V Q ir exp(-W t (P^) , PM, as,) 



w t 



(6) 



where Q ir = exp(- Shj(P^\ P« , a; 4 )w J ). If we 

use discrete weak learners, /i(-) G {— 1,+1}, then 6h t (-) E 
{—2, 0, 2}, and i 7 ^ can be simplified into: 

F p = ^2 Qir + Q ir exp(-2?« f ) + ^2 Qir exp(2w t ). 

5/( t =0 <5/i t =2 Sh t = -2 

(7) 

Let Q+ = X)ih t =2 Qir and Q_ = ^ 5ftt= _ 2 Qir, then F p is 
minimized when 



i, (Q+ 



(8) 



When real-valued weak learners are used, with the output in 
-1,1], we can calculate wt by minimizing the upper bound 



of F p as follows, 



F p <J2Q tr [0-5exp(w t )(l - 6h t (P^ ^^x,)) 

ir 

+ 0.5exp(-w t )(l + Sh t (P^\P^,Xi)) 

Here we have used the fact that exp(— wh) < 0.5 exp(w)(l - 
h) + 0.5cxp(— w)(l + h). Similarly F p is minimized when 



w t 



log 



1-6 



(9) 



where b — ^2 ir Qi r Sh t (P( yi \ P( r > , a;,). For the stage-wise 
boosting algorithm, we simply replace Step © in Algorithm 1 
with (8) or (9). Note that the formulation of our stage- 
wise boosting is similar to that of RankBoost proposed by 
Freund et al. [16]. Besides the efficiency of optimization at 
each iteration, this stage-wise optimization does not have any 
parameter to tune. One only needs to determine when to stop. 
The disadvantage, compared with totally corrective boosting 
[9], is that it may need more iterations to converge. 

2 We can simply set v to be zero in stage-wise boosting. Following the 
framework of gradient-descent boosting of [28], [29], we can obtain the same 
formulation as described here. 



General convex loss The following derivations are based on 
the important concept of convex conjugate or Fenchel duality 
from convex optimization. 

Definition 1 (Convex Conjugate). Let f : R n -> R. The 

function F* : R n -> R, defined as 

/*(«)= sup [u T x-f(x)}, (10) 

x£domf 

is the convex conjugate or Fenchel duality of the function /(■). 
The domain of the conjugate function consists of u £ R™ for 
which the supremum is finite. 

It is easy to verify that /*(■) is always convex since it is 
the point-wise supremum of a family of affine functions of u. 
This holds even if /(•) is not a convex function. 

If /(■) is convex and closed, then /** = /. For a point-wise 
loss function, A(p) = J2i A(pi)> tne convex conjugate of the 
sum is the sum of the convex conjugates: 



A*(«) = E \ ^ P-^Kpi) \ = Y^ su Pi u iP* 

=$>*(«i)- 



AG*)} 

(11) 



We consider functions of Legendre type here, which means, 
the gradient /'(•) is defined on the domain of /(•) and is an 
isomorphism between the domains of /(•) and /*(•). 

The general ^i-norm regularized optimization problem we 
want to learn a classifier is 

min A(pj r ) + vlJw 

w.p ^ 

ir 

S.t.: p ir =6h(P {v >\P {r \x l ) T w,w > 0. (12) 

Here A(-) is a convex surrogate of the zero-one loss, e.g., the 
exponential loss, logistic regression loss. We assume that A(-) 
is smooth. 

Although the variable of interest is w, we need to keep 
the auxiliary variable p in order to derive a meaningful dual 
problem. The Lagrangian is 

L = 22 A(pir) + ^l 7 w 

ir 

-J2^r(p l r-6h(P^\P^ 1 X l ) T w) - q T W 

ir 

= \vl T + J2 u lr 8h{P^\ P (r) , Xif - q\ w 



E 



^irPi 



A(pir) 



In order for L to have finite infimum over the primal variables, 
the first term of L must be zero, which leads to 



r *ft(pk*>,pM,x i ) T > 0. 



(13) 



The infimum of the second term of L is — J2ir A* (ui r ) by 
using (10) and (11). Therefore the Lagrange dual problem of 
(12) is 



\*(u ir ), s.t.: (13). 



(14) 
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We can reverse the sign of the dual variable u and rewrite 

(14) into its equivalent form 

min } X*(—Ui r ) 

ir 

s.t.: ^\ lr 5fc(P fel) ,P (r) ,^) T < ^1 T - (15) 

ir 

The KKT condition between the primal (12) and the dual (15) 
shows the relation of the primal and dual variables 

U ir = -X'(p ir ), (16) 

which holds at optimality. The dual variable u is the negative 
gradient of the loss at /?,>. This can be obtained by setting 
the first derivative of L to be zeros. Under the assumption 
that both the primal and dual problems are feasible and the 
Slater's condition satisfies, strong duality holds between the 
primal and dual problems. 

We need to use column generation to approximately solve 
the original problem because the dimension of the primal 
variable w can be extremely large or infinite. The high-level 
idea of column generation is to only consider a small subset 
of the variables in the primal; i.e., only a subset of w is 
considered. The problem solved using this subset is called 
the restricted master problem (RMP). It is well known that 
each primal variable corresponds to a constraints in the dual 
problem. Solving RMP is equivalent to solving a relaxed 
version of the dual problem. With a finite w, the set of 
constraints in the dual problem are finite, and we can solve the 
dual problem such that it satisfies all the existing constraints. 
If we can prove that among all the constraints that we have not 
added to the dual problem, no single constraint is violated, then 
we can draw the conclusion that solving the restricted problem 
is equivalent to solving the original problem. Otherwise, there 
exists at least one constraint that is violated. The violated 
constraints correspond to variables in primal that are not in 
RMP. Adding these variables to RMP leads to a new RMP 
that needs to be re-optimized. To speed up convergence, one 
typically finds the most violated constraint in the dual by 
solving the following problem, according to the constraint in 

(15) : " 

maxy« lr ^(P fe) ,P w ,a;,). (17) 

ir 

We only need to change the primal and dual problems 
involved in Algorithm 1 to obtain the column generation 
based multi-class random boosting with a general convex 
loss function. Specifically, only two lines need a change in 
Algorithm 1 and the rest remains identical: 

Step ©: Solve the primal problem (12), or the dual problem 
(15); 

Step ©: If the primal problem is solved, update the dual 
variable u using (16). 

Note that the derivation of the dual problem (3) also 
follows the above analysis (using the fact that the convex 
conjugate of the log-sum-exp function is the Shannon entropy). 
Mathematically the convex conjugate of f(x) — log(^ x.i) 
is f*(u) = J^i u i 1°S M «> if u > and J2i u i = 1> otherwise 

/*(«) = 00. 



B. Multi-class boosting by randomly projecting weak classi- 
fiers' outputs 

In contrast to the approach proposed in the previous section, 
where we randomly project the original data to new spaces, 
we can also randomly project the output of weak classifiers, 
H, to new spaces. Our intuition is that if H is linearly 
separable then the randomly projected data, PH T , is likely 
to be linearly separable as well, as long as the random 
projection matrices satisfy some mild assumptions [13]. As in 
the previous approach, we learn a multi-class classifier based 
on pairwise comparisons. We create k pre-defined random 
projection matrices, PW,p( 2 ', • • • ,P( fc ), one for each class. 
Given a training instance (a^, jft) and the weak classifiers' re- 
sponses, Hi : , the condition P( yi >Hj.w > P^Hj.w,Vr ^ yt 
has to be satisfied. The intuition is the same as in the previous 
case: the correct model's response should be larger than all 
the incorrect models' responses. Note that in this approach, 
w E E™, i.e., it has a fixed size and is independent of the 
number of boosting iterations (as compared to the previous 
approach where the size of w is equal to the number of 
boosting iterations, w g M T ). We define a margin associated 
with the above condition as pi r = P^ Vi >Hj.w — Pv'Hj.w. 
Now the margin has been defined, and the learning can be 
solved within the large-margin framework, as described in 
the previous section. We now apply the logistic loss due to 
its robustness in handling noisy data [28]. Again, any other 
convex surrogate loss can be used. Since the projected space, 
PH T , can also be much larger than the original space, H, 
we expect that some projected features might turn out to be 
irrelevant. We also apply the £i-norm regularization as in [9], 
resulting in the following learning problem: 

m k 

min -VV log(l + exp (-p ir )) + vl T w (18) 

i—l r=l 

s.t.: p„ =P^Hj.w -P^Hj.w,Vi,Vr; w > 0. 

Note that w > enforces the non-negative constraint on w. 
The Lagrangian of (18) can be written as 

L=— Vlog(l+exp(-p ir )) +ul T w (19) 

777 K, 

i,r 

with u > and p > 0. At optimum, the first derivative of the 
Lagrangian w.r.t. the primal variables, w, must be zeros, 

£ = J2 u * (P (y ° " P (r) ) Hj : = P T vl T (20) 

i,r 

where 5P(y i ,r) = P®' — P". By taking the infimum over 
the primal variables, pi r , 

|^=0^ = -log(^^Ww, (21) 
a Pir \mkuir — 1 J 
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and 



inf L = — - [— (1 + mkuir) log (1 + mkui r ) (22) 
Pir mk z — ' L 



mkui r log (— mkuir) 



Reversing the sign of it, the Lagrange dual can be written as 



max 



s.t. 



^ m k 

} y \ mkuir log (mkuir ) + (23) 

mk ^-^ l 

i—l r—1 

(1 — mku ir ) log (1 — mkuir) 
^u ir 6P( yi ,r)Hl < ul T . 

Note that here the number of constraints is equal to the size 
of the new space (n). At each iteration, we choose the weak 
learner, h t (-), that most violates the dual constraint in (23). 
The subproblem of generating the weak classifier at iteration 
t can then be expressed as: 



h* {■) = argmax > itj 



[JJ i) i :t _i,/i(x i )]5p(y i ,r;u), (24) 



1, ... ,n, vTi(-) € Ji and 



Sp(yi,r;v) 



P. 



P 



(r) 



p(l/i) 



pW 



pTxl 



Here a significant difference compared with conventional 
boosting is that the selection of the current best weak classifier 
depends on all previously selected weak classifiers. 

The idea behind our approach is that performance improves 
as more weak classifiers, h(-), are added to the constraint. 
This process can continue as long as there exists at least one 
constraint that is violated, i.e., 

max (Y,^ r U lr SV{y t: r)Hl 1 .^ < v + e, 

or when adding an additional weak classifiers ceases to have 
a significant impact on the objective value of (18), i.e., 

Opt t _ 1 — Opt t 



Opt t _ 



(25) 



< e. In our experiments we use the latter as our 
stopping criterion. Through the KKT optimality condition, the 
gradient of Lagrangian (19) over primal variables, p, and dual 
variables, u, must vanish at the optimum. The relationship 
between the optimal value of p and u can be expressed as 

_ exp(-p ir ) 
mktyl + exp(— pir)) 

The details of our random projection based multi-class boost- 
ing algorithm are given in Algorithm 2. 

The use of general convex loss here follows the similar 
generalization procedure as shown in the previous section. 

C. Computational complexity 

We analyze the complexity of our new approaches in this 
section. For the sake of completeness, we also analyze the 
computational complexity of training weak classifiers. For 
simplicity, we use a decision stump as our weak classifier. Note 
that any weak classifier algorithms can be applied here. For 
fast training of decision stumps, we first sort feature values and 



Algorithm 2 Column generation based RBoost F 
Input: 

— A set of examples ixi i — 1 - - - m; 

— The maximum number of weak classifiers, T; 



— Random projection matrices, P' 
Output: A multi-class classifier 
F(x) — argmax p( r ) [h± (x), ■ ■ 

Initilaize: 

- t «- 0; 

- H = 0; 

— Initialize sample weights, Ui r — 
while t < T do 

(D Train a weak learner, h, 
® If the stopping criterion, 
training; 



,h T (x)Y 



), using (24); 
Op*t — 1 — Opt^ 
Opt t _! 



< e, has been met, stop the 



® Add the best weak learner, ht(-), into the current set H; 

® Solve the primal problem, (18), e.g., using Quasi-Newton methods such 

as L-BFGS-B; 

© Update sample weights (dual variables) using (25); 
© t <- t + 1; 



cache sorted results in memory. At each boosting iteration, all 
decision stumps' thresholds will be searched and the optimal 
decision stump h*(-), which satisfies (5) or (24), will be 
saved as the weak learner for the i-iteration. For RBoost lank 
(Algorithm 1), the total number of pairwise relationships is 
m(k — 1). We first sort features in each projected dimen- 
sion. This pre-processing step requires O (nmk log(mfc)) for 
sorting n dimensions. In Step ® we train decision stumps 
for each projected dimension. Step ® takes 0(nmk). Step © 
can simply be ignored since it can be solved efficiently using 
(8) or (9). Let the maximum number of iterations be T, the 
time complexity is 0(nmkT). The total time complexity for 
RBoost rank is 0{nmk\og(mk) + nmkT). 

For RBoost pro J (Algorithm 2), the time required to sort 
d features is O(dmlogm). Step ® finds the optimal weak 
learner that satisfies (24). The multiplication, Ui r Sp(yi,r;v), 
in (24) takes 0(nmk) for all n dimensions. Training decision 
stumps requires 0{nmd). Hence, Step ® requires 0(nmk + 
nmd). In Step © we solve n variables at each iteration. Let us 
assume the computational complexity of L-BFGS-B is roughly 
cubic. Hence, the time complexity for T boosting iterations 
is 0{nm(k + d)T + n 3 T) and the total time complexity 
for RBoost proj is O (dm log m + nm{k + d)T + n 3 T). The 
computational complexity of both approaches is summarized 
in Table I. Note that weak classifier training (learning decision 
stumps) take up most of the computation time for both 
methods. 

D. Discussion 

Advantage of applying random projections One possible 
advantage of applying random projections is that random 
projections may further increase class separation on some data 
sets. We illustrate this in the following toy example. We gener- 
ate an artificial data set with four diagonal distributions. Each 
diagonal distribution is randomly generated from the multivari- 
ate normal distribution with covariance, [2.5, 1.5; 1.5, 1] and 
mean [—3, 2], [—3, —2], [3, 4], [3, 0]. We train a one-versus-all 
boosting (with the decision stump as the weak learner) and plot 
the decision boundary at 5, 100 and 1000 boosting iterations. 
We also randomly project the artificial data to the new 2D 
space and train the one-versus-all boosting classifier. Decision 
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WORKING PAPER 





RBoost mnk 


RBoost P r °J 


Pre-processing step for decision stumps 


0(nink log(mAi)) 


0(dm log m) 


At each iteration 






Train the weak learner (decision stump) 


0(nmk) 


0(nmk + nmd) 


Solve the optimization problem (RBoost) 


0{mk) 


0(n 3 ) 


Total computational complexity 


0(nmk log(mfc) + nmkT) 


O (dm log m + nm(k + d)T + n 3 T) 



TABLE I 

Computational complexity of RBoost. m is the number of training samples, n is the number of projected dimensions, k is the 

NUMBER OF CLASSES, d IS THE DIMENSION SIZE OF THE ORIGINAL DATA. T IS THE NUMBER OF ITERATIONS. NOTE THAT WEAK CLASSIFIER TRAINING 
(LEARNING DECISION STUMPS) TAKE UP MOST OF THE COMPUTATION TIME FOR BOTH METHODS. 



Original 2D space (5 iterations} Original 2D space (100 iterations) Original 2D space (1000 iterations) Random subspace (5 iterations) 




Fig. 2. Decision boundaries on the artificial data set. First 3 columns: Classification of four diagonal distributions on the original two dimensional space at 
5, 100 and 1000 boosting iterations. Last column: Classification on randomly projected subspace (selectively chosen to illustrate a better separation between 
classes). 



boundaries of different examples are shown in Fig. 2. From 
the figure, classification on the randomly projected subspace 
(Fig. 2: last column) clearly indicates its advantage compared 
to classification on the original space. 

Theoretical justification based on margin analysis In 
this section we justify the use of random projections on 
the proposed single-model multi-class classifier. We begin by 
defining the margin on MultiBoost [27] and its bound when 
the weak classifiers' response, H, is randomly projected to the 
new space with a random projection matrix, P. 

Definition 2 (Multi-class Margin for Boosting). Given a 
data set, S = {(x. l G R d , y< G y = {1, • • • , fc})}™ u the weak 
learners' response on the training data, H, and weak learners' 
coefficients, W = [w[, • ■ ■ tujA where w r £ ]R T is weak 
learners' coefficients for class r. The margin for boosting can 
be defined as, 

. ( (Wy,H(x)) (Wy>,H(x)) \ 

7 = mm =77- — 7-—- — max 77 ^—r, — — - 77- . 

(«,»)esV||to„||||H(x)|| v'*v ||uv||||H(aO||J 

Theorem 1 (Margin Preservation). If the boosting has mar- 
gin 7, then for any S,e E (0,1) and any 

12 , 6km 



with probability at least 1 — 5, the boosting associated with pro- 
jected weak learners' coefficients, Pw r ,\/r, and the projected 
weak learners' response, PH, has margin no less than 

1 + 3e Vl - e 2 1 + e 
1 - e 2 1 + e 1 - e ' 

The above theorem shows that the multi-class margin can be 
well preserved after both weak learner's coefficients, W, and 



weak learners' responses, H, are randomly projected. This 
theorem justifies the use of random projection on MultiBoost 
[27]. The next theorem defines margin separability for the 
proposed single-vector parameterized multi-class boosting. 

Theorem 2 (Single-vector Multi-class Boosting). Given 
any random Gaussian matrix R G M" xfeT , whose entry 
— ~^i a ij wnere a ij is i-i-d. random variables from 
3ST(0, 1). Denote P y G R n ' T as the y-th submatrix of R, that 
is R = [Pi, • • • , P r , • • • , Pj.]. If the boosting has margin 7, 
then for any S, e G (0, 1] and any 

12 6m(fc - 1) 



there exists a single-vector v G K. n , such that 

p ( (v,P y H(x)) - (v,P y ,H(x)) > 
y\\v\\y/\\P y U(xW + \\P yl U(xW ~ 

+ l + e — 7) > 1 - 8, Vy' ^ y. (26) 

The above theorem reveals that there exists a single-vector 
v G K." under which the margin is preserved up to an order 
of 0(7 /y2k). In other words, the multi-class margin can be 
well preserved after random projection as long as the newly 
projected dimension, n, satisfies some mild condition. Not 
only the theorem justifies the use of random projection to learn 
the single-model classifier, it also shows that the projected 
dimensions, n, only grows logarithmically with the number of 
classes, k. This finding is important for problems where the 
number of classes is large. Note that Theorem 2 only applies 
to the second approach presented in this work. 
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IV. Experiments 

We evaluate our approaches on artificial, machine learning 
and visual recognition data sets and compare our approaches 
against existing multi-class boosting algorithms. For Ad- 
aBoost.ECC, we perform binary partitioning at each iteration 
using the random-half method [30]. Decision stumps are used 
as the weak classifier for all boosting algorithms. 

A. Toy data 

We first illustrate the behavior of our algorithms on ar- 
tificial multi-class data sets. We consider the problem of 
discriminating various object classes on a 2D plane. For this 
experiment the feature vectors are the icy-coordinates of the 2D 
plane. We train 6 different classifiers using AdaBoost.ECC [7], 
AdaBoost.MH [26], AdaBoost.MO [26], MultiBoost [27], and 
our proposed RBoost lank and RBoost pr °j. For MultiBoost, we 
use hinge loss and choose the regularization parameter from 
{1(T 4 , 1CT 3 , 1CT 2 }. For both RBoost lank and RBoost™, 
we set n to be equal to 500. For RBoost pr °j, we choose the 
regularization parameter, v, from {10 -7 , 10 -6 , 10~ 5 , 10 -4 , 
10 -3 }. In this experiment, we set the number of boosting 
iterations to 500. Fig. 3 plots decision boundaries of various 
methods. On two dimensional toy data sets, we observe that 
decision boundaries of RBoost lank are very similar to the true 
decision boundary. This is not surprising since all toy data sets 
are generated from the multivariate normal distribution. Hence, 
RBoost lank produces very accurate decision boundaries. 

Size of the projected space We use three previous artificial 
data sets and vary the size of the projected space, n. Each 
data set is randomly split into two groups: 75% for training 
and the rest for evaluation. We set the maximum number 
of boosting iterations to 500. We vary n from 1000 to 
10,000 for RBoost lank and 250 to 2000 for RBoost pro J. For 
RBoost lank , the larger the parameter n, the more features that 
the algorithm can choose during training. From Theorem 1, as 
long as D is approximately larger than log(mfc), the margin is 
preserved with high probability for RBoost pr °j. Table II reports 
final classification errors of various n. For RBoost lank , we 
observe a slight increase in generalization performance when 
n increases. For RBoost pr °j, as long as n is sufficiently large 
(> 250 in this experiment), the final performance is almost 
not affected by the value of n. 

B. Totally-corrective RBoost rank and stage-wise RBoost mnk 

In this experiment, we compare the performance of totally- 
corrective RBoost rank with stage- wise RBoost rank . We use UCI 
machine learning repository data sets and randomly split the 
data sets into two groups: 75% of samples for training and the 
rest for evaluation. We set the maximum number of boosting 
iterations to 500. We conduct an experiment on two convex 
losses: the exponential loss and the logistic loss. For totally- 
corrective boosting, we solve the optimization problem, step © 
in Algorithm 1, using L-BFGS-B. For L-BFGS-B parameters, 
we set the maximum number of iterations to 100, the accuracy 
of the line search to 10 -5 , the convergence parameter to 
terminate the program to 10 7 ■ e (where e is a machine 



precision) and the number of corrections to approximate the 
inverse hessian matrix to 5. We use the same L-BFGS-B 
parameters for all experiments. The regularization parameter 
in (12), v, is determined by 5-fold cross validation. We 
choose the best v from {10~ 7 , 10~ 6 , 10~ 5 , 10~ 4 , 10~ 3 , 
10 -2 }. For stage-wise RBoost rank , we set n to be 100 times 
the dimension size of the original data. All experiments are 
repeated 10 times and the average and standard deviation 
of test errors are reported in Table III. We observe that the 
performance of both convex losses are comparable and stage- 
wise RBoost lank produces comparable test accuracy to totally- 
corrective RBoost lank . However, stage-wise RBoost lank has a 
much lower CPU time. Since both totally-corrective and stage- 
wise RBoost mnk are comparable, we use stage- wise RBoost rank 
in the rest of our experiments. 

C. UCI data sets 

The next experiment is conducted on both binary and multi- 
class UCI machine learning repository and Statlog data sets 3 . 
For binary classification problems, we compare our approaches 
with AdaBoost [24] while for multi-class problems, we com- 
pare our approaches with AdaBoost.MH [26], AdaBoost.MO 
[26], AdaBoostECC [7] and MultiBoost [27]. Each data set 
is then randomly split into two groups: 75% of samples for 
training and the rest for evaluation. We set the maximum 
number of boosting iterations to 1000. For AdaBoost.MH, 
AdaBoost.MO and AdaBoost.ECC, the training stops when 
the algorithm converges, e.g., when the weighted error of 
weak classifiers is greater than 0.5. For MultiBoost, we use 
the logistic loss and choose the regularization parameter from 
{10~ 8 , 10~ 7 , 10~ 6 , 10" 5 , 10~ 4 }. For RBoost lank , we set n to 
be 2- 10 4 . For RBoost pr °j, we set n to be equal to the number of 
boosting iterations, i.e., 1000. Note that we have not carefully 
tuned n in this experiment. The regularization parameter, v, is 
determined by 5-fold cross validation. We choose the best v 
from {10 -5 , 10 -4 , 10~ 3 , 10~ 2 } for binary problems and from 
{10~ 8 , 2.5 x 10~ 8 , 5 x 10~ 8 , 7.5 x 10" 8 , 10" 7 , • • • , 10~ 2 } 
for multi-class problems. The training stops when adding 
more weak classifiers does not further decrease the objective 
function of (18). All experiments are repeated 10 times and 
the mean and standard deviation of test errors are reported 
in Tables IV and V. For binary classification problems, we 
observe that all methods perform similarly. This indicates that 
random projection based classifiers work well in practice. 
This is not surprising since it can be shown easily that, for 
two-class problems, RBoost rank simply performs AdaBoost on 
the randomly projected data [17]. By the theory of random 
projections one would expect the performance of AdaBoost 
trained using the data in the original space to be similar to that 
of AdaBoost trained using the randomly projected data. For 
multi-class problems, we observe that most methods perform 
very similarly. However, RBoost rank has a slightly better 
generalization performance than other multi-class boosting 
algorithms on 5 out of 11 data sets while RBoost proj performs 
slightly better than other algorithms on 3 out of 11 data sets. 

3 For USPS and pendigits, we use 100 samples from each class. 



10 



WORKING PAPER 








RBoost 


rank 




RBoost P r °J 


Data set 


n = 1000 


2500 


5000 


10000 


n = 250 


500 


1000 


2000 


Synthetic I 
Synthetic 2 
Synthetic 3 


3.4 (1.5) 
0.6 (0.6) 
3.7 (1.2) 


2.6 (0.9) 
0.6 (0.4) 
3.3 (1.4) 


2.1 (0.9) 
0.2 (0.4) 
4.0 (1.3) 


1.8 (1.5) 
0.5 (1.1) 
3.5 (1.3) 


11.2 (1.9) 
7.0 (2.3) 
6.5 (2.0) 


13.0 (1.6) 
8.2 (1.4) 
6.9 (2.0) 


11.2 (1.3) 
6.6 (2.2) 
6.4 (1.8) 


10.9 (2.1) 

7.0 (1.3) 

7.1 (2.9) 



TABLE II 

Average test errors and standard deviations (shown in %) for different values of n. All experiments are repeated 5 times 



Data set 


Exponential loss (TC) 


Logistic loss (TC) 


Exponential loss (SW) 


Test error 


CPU time 


Test error 


CPU time 


Test error 


CPU time 


australian 


14.9 (2.5) 


11.7 


17.4 (2.6) 


6.3 


15.8 (2.1) 


0.03 


heart 


19.9 (4.5) 


6.5 


22.7 (3.8) 


1.9 


21.3 (4.2) 


0.02 


wine 


3.0 (2.4) 


13.7 


2.5 (2.3) 


1.4 


2.5 (2.7) 


0.03 


glass 


34.9 (6.1) 


10.6 


30.4 (5.2) 


4.8 


31.3 (6.2) 


0.03 


segment 


3.0 (0.6) 


145 


3.1 (0.7) 


45.2 


3.3 (0.6) 


0.09 



TABLE III 

Average test errors (in %) and CPU time (seconds) (time taken to solve the optimization problem in step ® Algorithm 1). TC: 

TOTALLY-CORRECTIVE RBOOST rank AND SW: STAGE-WISE RBOOST RANK 



We then statistically compare both proposed approaches 
using the nonparametric Wilcoxon signed-rank test (WSRT) 
[31]. WSRT tests the median performance difference between 
RBoost pro J and RBoost rank . In this test, we set the significance 
level to be 5%. The null-hypothesis declares that there is no 
difference between the median performance of both algorithms 
at the 5% significance level, i.e., both algorithms perform 
equally well in a statistical sense. According to the table of 
exact critical values for the Wilcoxon's test, for a confidence 
level of 0.05 and 11 data sets, the difference between the 
classifiers is significant if the smaller of the rank sums is equal 
or less than 10. Since the signed rank statistic result (16) is 
not less than the critical value (10), WSRT indicates a failure 
to reject the null hypothesis at the 5% significance level. In 
other words, the test statistics suggest that both RBoost prq > and 
RBoost lank perform equally well. 

We further conduct an additional experiment on RBoost mnk 
and RBoost pr °j using a different weak classifier. An alternative 
choice of weak classifiers for training boosting classifiers is 
weighted Fisher linear discriminant analysis (WLDA) [32]. 
WLDA learns a linear projection function which ensures 
good class separation between normally distributed samples 



of two classes. The linear projection function is defined as 
(Si + £2)~ 1 (/ii ^ M2) where /ii and 112 are weighted class 
mean, and and E 2 are weighted class covariance matrices 
of the first and second class, respectively. In our experiment, 
we project the weighted input data to a line using WLDA and 
train the decision stump on the new ID data [32]. Although 
WLDA has a closed-form solution, computing the inverse 
of the covariance matrix can be computationally expensive 
when the size of covariance matrices is large, i.e., the time 
complexity is cubic in the size of covariance matrices which is 
0([mm(n, (to — 1)/c)] 3 ). For RBoost mnk , it is computationally 
infeasible to find the inverse of the covariance matrices when 
n (n — 20, 000) and (to — l)fc is large. The is one of the 
advantages for RBoost rank , compared with RBoost prq >. 

So instead we randomly select 1000 dimensions from n at 
each boosting iteration and then apply WLDA. We concatenate 
the new WLDA feature to n randomly projected features 
and train RBoost rank . The average classification error of both 
approaches is shown in Table VI. 1) We observe that the 
performance of both approaches often improves when we 
apply a more discriminative WLDA as the weak learner, 
compared with decision stumps. 2) Again, we statistically 
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AdaBoost 






RBoost 






RBoost <""> 




Data set 


Test 50 


Test 100 


lest 1000 


Test 50 


Test 100 


lest 1000 


Test 50 


Test 100 


lest 1000 


australian 


14.8 (2.9) 


14.8 (2.1) 


16.6 (2.1) 


15.3 (2.8) 


15.7 (2.2) 


16.9 (2.6) 


14.2 (2.4) 


14.2 (2.4) 


14.2 (2.4) 


b-cancer 


4.3 (1.2) 


4.4 (1.1) 


4.6 (1.3) 


4.6 (1.4) 


4.2 (1.3) 


4.3 (1.4) 


3.9 (1.0) 


4.0 (1.0) 


4.1 (1.0) 


c-cancer 


20.0 (7.7) 


18.7 (8.8) 


16.0 (9.0) 


16.7 (7.9) 


15.3 (8.3) 


16.0 (7.8) 


23.3 (11.9) 


23.3 (11.9) 


23.3 (11.9) 


diabetes 


26.7 (2.1) 


26.3 (3.0) 


25.7 (2.9) 


25.7 (1.5) 


25.5 (1.3) 


26.4 (2.6) 


25.5 (2.2) 


25.7 (2.1) 


25.7 (2.1) 


german 


24.2 (2.3) 


24.4 (2.3) 


25.8 (3.0) 


24.6 (2.5) 


24.2 (2.9) 


24.9 (3.1) 


24.5 (1.6) 


24.5 (1.6) 


24.5 (1.6) 


heart 


16.7 (3.1) 


17.6 (3.4) 


20.9 (3.1) 


16.9 (3.9) 


16.7 (4.0) 


17.6 (3.5) 


19.6 (2.2) 


19.9 (1.9) 


19.9 (1.9) 


ionosphere 


11.7 (2.2) 


11.6 (2.4) 


10.0 (2.8) 


9.4 (3.1) 


7.5 (2.9) 


7.4 (3.6) 


12.2 (3.2) 


11.9 (3.3) 


12.0 (3.3) 


liver 


27.9 (6.3) 


28.0 (4.6) 


28.4 (3.7) 


30.6 (4.7) 


30.5 (4.6) 


30.6 (3.6) 


29.9 (5.6) 


30.0 (5.4) 


30.0 (5.4) 


mushrooms 


0.2 (0.1) 


0.0 (0.0) 


0.0 (0.0) 


0.1 (0.1) 


0.0 (0.0) 


0.0 (0.0) 


0.0 (0.0) 


0.0 (0.0) 


0.0 (0.0) 


sonar 


17.9 (4.3) 


16.9 (3.8) 


16.5 (2.9) 


22.5 (5.4) 


19.8 (6.1) 


18.8 (4.7) 


21.7 (4.9) 


18.5 (4.0) 


19.0 (4.5) 


splice 


8.7 (0.8) 


8.4 (1.1) 


8.7 (1.3) 


16.5 (1.6) 


15.1 (1.6) 


11.3 (1.3) 


8.3 (1.2) 


8.2 (1.2) 


8.2 (1.2) 



TABLE IV 

Average test errors and standard deviations (in %) of the proposed algorithms on two-class UCI data sets. All experiments are 

REPEATED 10 TIMES. TEST ERRORS AT 50, 100 AND 1000 BOOSTING ITERATIONS ARE REPORTED 



Data set 


AdaBoost.ECC 


AdaBoost.MH 


AdaBoost.MO 


MultiBoost 


RBoost rank 


RBoost P'°J 


dna (3 classes) 


6.8 (0.9) 


5.6 (1.2) 


6.9 (1.2) 


7.0 (0.9) 


6.7 (0.9) 


6.7 (0.9) 


svmguide2 (3 classes) 


23.2 (3.7) 


21.7 (3.3) 


22.9 (4.3) 


22.1 (4.2) 


19.8 (3.0) 


21.1 (3.6) 


wine (3 classes) 


3.9 (3.0) 


4.3 (3.8) 


3.6 (3.7) 


4.3 (3.5) 


3.2 (2.9) 


3.0 (3.0) 


vehicle (4 classes) 


21.0 (3.6) 


21.6 (3.4) 


21.3 (3.0) 


21.8 (3.0) 


20.0 (2.3) 


22.1 (2.3) 


glass (6 classes) 


23.0 (3.8) 


27.0 (3.6) 


26.2 (6.8) 


26.2 (5.5) 


26.8 (4.5) 


22.5 (4.2) 


satimage (6 classes) 


11.5 (0.7) 


11.1 (1.1) 


10.7 (1.0) 


11.6 (0.9) 


10.2 (0.5) 


13.1 (0.8) 


svmguide4 (6 classes) 


15.9 (2.7) 


17.5 (2.5) 


17.9 (2.3) 


19.0 (3.5) 


17.6 (2.8) 


17.4 (2.1) 


segment (7 classes) 


2.1 (0.5) 


3.0 (0.5) 


2.3 (0.5) 


2.4 (0.7) 


3.2 (0.8) 


2.1 (0.3) 


usps (10 classes) 


9.2 (2.1) 


9.2 (1.7) 


8.8 (2.5) 


10.0 (1.8) 


8.8 (2.6) 


9.1 (2.7) 


pendigits (10 classes) 


5.2 (0.8) 


5.8 (0.9) 


6.3 (1.4) 


7.0 (1.4) 


2.8 (0.9) 


5.2 (0.9) 


vowel (11 classes) 


8.7 (2.5) 


11.2 (2.3) 


12.1 (3.0) 


9.3 (2.8) 


3.1 (1.3) 


8.1 (2.2) 



TABLE V 

Average test errors (in %) of different algorithms on multi-class UCI data sets. All experiments are repeated 10 times and the 

NUMBER OF BOOSTING ITERATIONS IS SET TO 1000 



DNA (3 classes) WINE (3 classes) VEHICLE (4 classes) GLASS (6 classes) SATIMAGE (6 classes) 




Number of iterations Number of iterations Number of iterations Number of iterations Number of iterations 



Fig. 4. Average test error curves on multi-class UCI data sets. The vertical axis denotes the averaged test error rate and the horizontal axis denotes the 
number of boosting iterations. Best viewed in color. 



compare the performance of both proposed approaches (with 
WLDA as the weak learner). Since the signed rank statistic 
result (10.5) is not less than the critical value (0), WSRT 
indicates a failure to reject the null hypothesis at the 5% 
significance level. In summary, both algorithms also perform 
equally well when WLDA is used as the weak learner. Note 
that other weak learners, e.g., LIBLLNEAR and radial basis 
function (RBF), may also be applied here. 

We plot average test error curves of multi-class UCI data 
sets in Fig. 4. Again, we can see that both of the proposed 
methods perform similarly. 



D. Handwritten digits data sets 

In this experiment, we vary the number of training samples 
and compare the performance of different boosting algorithms. 
We evaluate our algorithms on popular handwritten digits 
data sets (MNIST) and a more difficult handwritten character 
data sets (TiCC) [33]. We first resize the original image to a 
resolution of 28 x 28 pixels and apply a deslant technique, 
similar to the one applied in [34]. We then extract 3 levels of 
HOG features with 50% block overlapping (spatial pyramid 
scheme) [35]. The block size in each level is 4 x 4, 7 x 7 
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Data set 


KBoost (WLDA) 


KBoost^ J (WLDA) 


svmguide2 (3 classes) 


18.9 (3.1) 


19.7 (2.6) 


wine (3 classes) 


3.2 (2.9) 


2.5 (3.1) 


vehicle (4 classes) 


19.6 (2.3) 


18.4 (2.6) 


glass (6 classes) 


25.9 (4.0) 


22.5 (4.2) 


pendigits (10 classes) 


2.8 (0.9) 


4.0 (1.2) 


vowel (11 classes) 


2.6 (1.1) 


4.5 (1.8) 



Caltech-256 (Related classes) 



Caltech-256 (Mixed classes) 



TABLE VI 

Average test errors (shown in %) with linear perceptron 
classifiers trained by weighted linear discriminant analysis 
(wlda) as the weak learner. 



\ 

\ 


V 




— Ada.MH 

— Ada.ECC 

— AdaMO 

— RandomBoost (rank) 

— RandomBoost (proj) 




































Number of iterations 



Number of iterations 



TiCC (Handwritten digits) MNIST 



— Ada.ECC 




draining examples/class #training examples/class 



Fig. 5. Average test errors on handwritten digits data sets by varying the 
amount of training samples per class, left: TiCC. right: MNIST. Best viewed 
in color. 

and 14 x 14 pixels, respectively. Extracted HOG features from 
all levels are concatenated. In total, there are 2, 172 HOG 
features. We perform dimensionality reduction using Principal 
Component Analysis (PCA) on training samples (similar to 
PCA-SIFT [36]). Our PCA projected data captures 90% of 
the original data variance. For RBoost mnk , we set n to be 
100A'. For RBoost pr °j, we choose the best parameter from 

{5 x icr 8 , icr 7 , 5 x icr 7 , icr 6 , 5 x icr 6 , icr 5 }. For mnist, 

we randomly select 5, 10, 20 and 40 samples as training 
sets and use the original test sets of 10, 000 samples. For 
TiCC, we randomly select 5, 10, 20 and 40 samples from 
each class as training sets and use 50 unseen samples from 
each class as test sets. All experiments are repeated 5 times 
(1000 boosting iterations) and the results are summarized in 
Fig. 5. For handwritten digits, we observe that our algorithms 
and AdaBoost.MO perform slightly better than AdaBoost.ECC 
and AdaBoost.MH. 

Note that AdaBoost.MO trains 2 k ~ 1 — 1 weak classifiers 
at each iteration, while both of our algorithms train 1 weak 
classifier at each iteration. For example, on MNIST digit data 
sets, the AdaBoost.MO model would have a total of 511,000 
weak classifiers (1000 boosting iteration) while our multi-class 
classifier would only consist of 1000 weak classifiers. In other 
words, AdaBoost.MO is 511 times slower during performance 
evaluation. We suspect that these additional weak classifiers 
improve the generalization performance of AdaBoost.MO for 
handwritten digits data sets, where there is a large variation 
within the same class label. 

E. Caltech-256 data sets 

We also evaluate our algorithms on a subset of Caltech- 
256. We consider two types of classes as experimented in 



Fig. 6. Average test error curves on Caltech-256 data sets, left: Related 
classes, right: Mixed classes. Best viewed in color. 

[37]: related classes 4 and mixed classes 5 . We use the same 
pre-computed features used in [38], i.e., PHOG, appearance, 
region covariance and LBP The data set is randomly split 
into two groups: 25% for training and the rest for evaluation. 
On average, there are 56 training samples per class for related 
classes and 48 training samples per class for mixed classes. We 
use the same setting as in the handwritten digits experiment. 
The average test accuracies of 5 runs are reported in Fig. 6. 
We see again that AdaBoost.MO converges faster. This is 
not surprising as we previously mentioned that AdaBoost.MO 
trains 2 fe ~ 1 — 1 weak classifiers at each iteration. As Ad- 
aBoost.MO is not scalable on a large number of classes, it 
is extremely slow during performance evaluation. Based on 
our experiments, AdaBoost.MO requires approximately 2 k ~ 1 
times as much execution time as other algorithms during test 
time. The second observation is that our proposed methods 
usually converge slightly faster than AdaBoost.MH and Ad- 
aBoost.MH. 

V. Conclusion 

We have shown that, by exploiting random projections, it 
is possible to devise a single-vector parameterized boosting- 
based classifier, which is capable of performing multi-class 
classification. This approach represents a significant diver- 
gence from existing multi-class classification approaches, as 
neither the number of classifiers, nor the number of pa- 
rameters will grow as the number of classes increases. We 
have demonstrated two examples of the proposed approach, 
in the form of multi-class boosting algorithms, which solve 
the pairwise ranking problem and pairwise loss in the large 
margin framework. These algorithms are effective and can 
cope with both binary and multi-class classification problems 
as demonstrated on both synthetic and real world data sets. 

Our goal thus far has been to formulate a single-vector 
multi-class boosting classifier, which demonstrates promising 
results and alleviate the proliferation of parameters typically 
faced in large-scale problems. Reducing the training time 
required by both methods for large-scale problems is yet 
another challenging issue. Techniques, such as approximating 
the weak classifiers' threshold [39], approximating the weak 

4 bulldozer, firetruck, motorbikes, schoolbus, snowmobile and car-side. 
5 dog, horse, zebra, helicopter, fighter-jet, motorbikes, car-side, dolphin, 
goose and cactus. 
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classifiers using FilterBoost [40] or incremental weak classifier 
learning [41], offer an interesting approach towards this goal. 
An exploration on the effect of the size of random projection 
matrices on convergence and scaling could also be carried out. 

Appendix A 
Proof of Theorem 1 

Margin Preservation If the boosting has margin 7, then 
for any <5, e G (0, 1) and any 



n > 



12 , 6km 
In — — , 



3e 2 - 2e 3 

with probability at least 1 — 5, the boosting associated with 
projected weak learners' coefficients, ~Pw r ,\/r, and the pro- 
jected weak learners' response, PH, has margin no less than 

1 + 3e 71^2 1 



1-e 2 1 + 6 ' 1-e 7 ' 
Proof: By margin definition, for all (x, y) E S 
(w y , H(x)) (w y ,,H{x)) 

-LIlcl-X. 1 1 1 1 1 1 ______ t k , 1 ~~ j . 



IK||||H(s)|| v'*v \\w v ' II II H(x 
Take any single (x,y) € S, and let y = argmax [i^lfiTH^T 



(«0ll 

V 

and y — argmax r[p^_r~Tiw|[7^ > we nave by Lemma 1 

y" y " 

(substituting x in Lemma 1 by H(x) ) and union bound over 
all y' ^ y 



Pr 1 



(Pw,,, PH(aQ) 
IP^IHIPH^II 



> 1 



1 



KIII|H(aO|| 



>l-6exp(--(---)), 



Pr (Vy' ^ y, 
(P%,PH(z)) 

|p«ylll|PH(aO|| 



< 1 - 



Vl-e 2 



1+e 1+e 
1-e (t/y,H(a;)) 



1 + e ||uv||||H(> 
> 1 - 6(fc - 1) exp 



n e e . 
2 ' ^ ~ J' 



Appendix B 
Proof of Theorem 2 

Single-vector Multi-class Boosting Given any random 



> i.i.d. random variables from N(0, 1). T 
E W l ' T as the y-th submatrix of R, that is R = [Pi, 
P r , • • • , Pfc]. If the boosting has margin 7, then for any 



Gaussian matrix R g ]R" X , whose entry R(i,j) 

where a,j is i.i.d. random variables from N(0, 1). Denote 

n 

(5, e e (0, 1] and any 



n > 



12 



In 



6m(k — 1) 
3e 2 -2e 3l±1 S ' 

there exists a single-vector v E R", such that 
(«, P v H(sg)) - (u, Pj,»H(x)) 



Pr 



\\v\\^/\\P y U(xW 
2e 1 + e 



1 



2fe(l 



h ||P,<H(:z)|| 2 
-7) > 1 - 



> 



Vy' ^ y. (27) 



Proof: By the margin definition, there exists w, such that 
for all (H(x),y) e 5, 



(t«„,H(»)) (uy,H(x)) 



|H(x) 



to 



y'l 



|H(x)|| 



> 7,Vj/' 7^ y- 



Without losing generality, we assume w y has unit length, 
which can always be achieved by normalization, for all y. 
So now 

K.H(aO) - K-,H(x)) > T ||H(a B )||,Vv / £ y. 

This can be rewritten as 

(u, H(aj) <B) e y ) - (u, H(x) ® e y <) = 
(H(£c) ® e 9 - H(cc) ® e y /,u) > 7||H(x)||, 



where u is the concatenation of all w 



.J 



.J 



i.e. u = 



[w[ , w y , • • • UJfc] , the vector e y € K fc with 1 at the y-th 
dimension and zeros in others, and <g> is the tensor product. 
Define z x y > — H(x) <£> e y — H(x) <Ei e y '. 

Applying Lemma 1 to u and z x ^ y i, we have for a given 
(x,y) and a fixed y' ^ y, with probability at least 1 — 
• 6 exp (— f (% — y)), the following holds, 



By the union bound again, with probability at least 



1-6/cm-exp ( --(- 2 



(P^Pz,,^) 
||Pu||||P^| 
1 + e 

> 1 



for all (x,y) E S, we have 

(p Wy ,PH{x)) 



1-e 
1 + e 



1 



(H(aO ' 



H(x) 



v^llullllH^H 



(P«i v ,,PH(x)) 



1 



||Pu>j,||||PH(a;)|| ^ ||Ptiylll|PH(a:) 

1 + 3e v / l~ r ~e 2 1 + e 
> H 1 7 



1-e 

e /(wj,,H(x)) (uy,H(a;)) 



1-e 2 



1 + e 



1 - e 



^2(1 -e) VNII|H(*)|| 
>1 1+6 1+6 



|u||||H(x) 



Let S = 6kmexp (— f — y)), we have the desirable lower 
bound on n. ■ 



1 

-2e 



e V2fc(l - e) 
1 + e 

-J- 



7 



2ft(l 
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By the union bound over m samples and k — 1 many y', 

Pr (B(x,y)eS, ly'^y, 



< 



(Pu,Pz x y] 
||Pu||||Pz a .,j / '|| 

< 6m(k — 1) exp 



-2e 



V2L{1 - e) 



2 

n e z 
2 ' *■ ~2 



Let q = Pu, we have 
(Pu,P2 a 



(q.PyX - Py'X) 



Setting 5 = 6m(k — 1) exp (—§(%- — y)) gives the bound on 
n. Thus 

Pr (v(x,y)e5,V2/Vy, 

(q.P y H(x)) - (q,P y ,U(x)) 



|g||VI|PvH(a:)|| a 
-2e 1 + e 



|P*'H(x) 



> 



7 >1 



1-e v^fc(l-e) 

which concludes the proof. ■ 
We have used the following two lemmas for proving the 
above two theorems. 



Lemma 1. For any w,x s W 1 , any random Gaussian matrix 
P G whose entry P(i,j) 
random variables from ZNT(0, 1), 



-7=ai,- where a.; 7 s are i.i.d. 



7 



||10|| \\x\\ 

for any e g (0, 1), if 7 € (0, 1], then with probability at least 



n e 

1 - 6exp(--( — 



-)). 



the following holds 



< 1 



(PtU,Px) 

|Piu||||Pa;|| 



(1 + e) (1 + e) (1 + e, 
Proof: From Lemma 2 and union bound, we know 



(28) 



\\Px\\ 2 , . , , \\Pw\\ 2 



(29) 



holds with probability at least 1 — 4 exp (— f — y))- When 
(29) holds, due to the fact that increasing the length of two 



unit length vectors (i.e. from 



Px 

Pxll 



and 



Pro 

IPtoll 



to 



Px 



and 
have 



-) increases the norm of their difference 6 , we 



Pa; 

JPx\ 



Pw 



< 



Pa; Pit; 



(30) 



It is easy to prove that 



Pa; 


Pw 


H 


HI 



< 



< 



Pa; 
|Px|| 
Pa; 



Pw 



\Pw\ 



Pw 



\Px\\ \\Pw\ 

+ ( V /[TT^)- V /(T^)) 2 . (31) 

The first inequality is due to (29), the second inequality is due 
to the property of an acute angle. 

Applying Lemma 2 to the vector ^p^y — Tf^lf)' we nave 

x w 2 Px Pw 2 

i{1+ ' ] \m-f- 2 <32) 

holds for certain probability. 



and Pa;, we have 



(w,x) 



w x 



= 1 - 



Similarly 



(P«7,Px) 



1 



||Pto||||Pa! 
Using (32), (30) and (31) we get 



2 




and 


a be the 


= 1 


-2sin 2 ( 


X 


w 


R 


INI 


Pa; 


Pw 


Px| 


\\Pw\ 


Px 


Pw 


||Px|| 


\\Pw\\ 



(33) 



(34) 

is bounded 

2 

below and above by two terms involving r^-n — w^-u 

ll^ll ll^l 

Plugging (33) and (34) into the two side bounds, we get (28). 
Here we applied Lemma 2 to 3 vectors, namely x, w, and 
( T|a!|T — ]1to|T )' t ^ us un i° n bound, the probability of the above 
holds is at least 1 - 6exp (— f(y - y)). ■ 

Lemma 2. For any x 6 R T , any random Gaussian matrix 
P G M™ xT whose entry P (i , j) — ^gCJjj where ays are i.i.d. 
random variables from N(0, 1), for any e € (0, 1), 

M^- e > s w- <1+e) ) 

>l-2«p(--(---)). 
Proof: Obviously, for any w, x G 



the following 



holds: 



E((Pw,Px)) 

d 



j=i »=i 

1 n d 



e=i j=i 



■y]E(atj)wj ^ E(a ft ); 



Note that the opposite does not hold in general. 



(W,X) 
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To obtain above, we only used the fact that {a^} are inde- 
pendent with zero mean and unit variance. 

Due to 2-stability of Gaussian distribution, we know 

where zg and z[ ~ 1NT(0, 1). we have (Pw,Px) = 

Ti\\ w \\\\ x W S"=i z i z 't If w = x < YTt=\ z \ is chi-square 
distributed with n-degree freedom. Applying the standard tail 
bound of chi-square distribution, we have 

Pr ( (Pio, Pa;) < (1 - e) (w, x) ) 

< exp Q(l - (1 - e) + ln(l - e))) < cxp (- V). 

Here we used the inequality ln(l — e) < — e — e 2 /2. Similarly, 
we have 

Pr ( (Pw, Px) < (1 + e) (w, x) ) 

< exp g(l - (1 + 6) + ln(l + 6))) < cxp - y ))• 

Here we used the inequality ln(l + e) < e - e 2 /2 + e 3 /3. ■ 
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